R Markdown weaves together narrative text and code (see code chunk below), improving the readability/reproducibility of your code. For more details on using R Markdown refer to this quick tour or this website. For syntax in R Markdown refer to this cheat sheet.
Note you may need to install the rmarkdown package, if
you have not done. Use code:
install.packages("rmarkdown")
This is an R Notebook file (see YAML header:
output: html_notebook). An R Notebook is an R Markdown
document with chunks that can be executed independently and
interactively, with output visible immediately beneath the input or in
the console. This feature of the R Notebook makes it a good choice while
creating an R Markdown document. When you Save this R Markdown document,
an html file should be generated automatically. When you are ready to
publish the document, you can render it to a publication format (e.g.,
HTML, Word, pdf) with the Knit button. For more about R Notebook, see
this book by Xie and
colleagues
# This is an example of an R code chunk.
# For this session, please only edit code within R code chunks like this, focusing on the missing pieces shown with "___".
For this session, we will mostly rely on the following packages:
Make sure these packages are installed. If not, use the function
install.packages() to download and install relevant
packages.
Note, additional packages will be required for specific sections. We will introduce and load them when needed.
# load packages
library(tidyverse)
library(ggplot2)
library(here)
library(zoo)
library(lubridate)
It’s always a good practice to inspect our working directory. This
can be done using getwd(). We can also manual set our
working directory using setwd().
here::here() is a nifty function that simplifies writing
file path. It also pairs well with R Project. We will use it repeatedly
in this session.
# inspect working directory
getwd()
## [1] "/Users/xiuqili/Documents/GitHub/DataClub_2022Fall/Session-1/code"
# inspect here::here()
here::here()
## [1] "/Users/xiuqili/Documents/GitHub/DataClub_2022Fall"
In this part, we will recreate the New York Time “New reported cases” graph
We can use read_csv() to import data directly from
GitHub.
# import data directly from GitHub
US_data <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us.csv")
## Rows: 973 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): cases, deaths
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# inspect data briefly
str(US_data)
## spec_tbl_df [973 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date : Date[1:973], format: "2020-01-21" "2020-01-22" ...
## $ cases : num [1:973] 1 1 1 2 3 5 5 5 5 6 ...
## $ deaths: num [1:973] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_date(format = ""),
## .. cases = col_double(),
## .. deaths = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(US_data)
## date cases deaths
## Min. :2020-01-21 Min. : 1 Min. : 0
## 1st Qu.:2020-09-20 1st Qu.: 6825950 1st Qu.: 199367
## Median :2021-05-21 Median :33099318 Median : 588850
## Mean :2021-05-21 Mean :37659506 Mean : 541669
## 3rd Qu.:2022-01-19 3rd Qu.:68557295 3rd Qu.: 857644
## Max. :2022-09-19 Max. :95514998 Max. :1050282
The GitHub dataset we imported in Step 2a shows the daily number of cases and deaths nationwide. We will need to calculate:
dplyr::lag()zoo::rollmeanr()New York Time “New reported cases” graph shows:
We will use ggplot to recreate this graph.
facet_wrap() and facet_grid() can be used
to easily create multi-panel plots. In this section, we will use
facet_wrap() to graph the rolling 7-day average of new
COVID cases per 100k people for each state.
This will require several steps:
group_by() followed mutate()facet_wrap()
For fun: the geofacet package works similarly as
facet_wrap(), except it arranges panels following a grid
that mimics the original topology. For more information on geofacet, see
here.
# load library
library(geofacet)
# use facet_geo() instead of facet_wrap()
ggplot(data = State_daily_graph,
mapping = aes(x = date, y = cases.daily.ave.density)) +
geom_line(color = "grey45") +
facet_geo(~ state) +
theme_bw() +
theme(panel.grid = element_blank()) +
labs(title = "New reported COVID-19 cases in different states",
subtitle = "Average daily cases per 100,000 people",
caption = "Data source: New York Times. Retrieved from GitHub.")
# export graph
ggsave(filename = "State_AverageDailyCasesPer100k_GeoWrap.png",
path = here::here("Session-1", "graphs"),
width = 14, height = 8, units = "in", bg = "white")
Let’s investigate the relationship between having a college/university in the county and masking rates within that county.
New York Times posted a survey that was carried out in July 2020, during the 2nd peak of the pandemic. Data from this survey was used to predict mask-wearing in different counties. Data and metadata are available on the New York Times GitHub. See here.
The us-colleges-and-universities.csv file in the data folder lists Colleges, Universities, and Professional Schools within the US. This .csv file was derived from this dataset on US Colleges and Universities. The us-colleges-and-universities.csv file has been simplified to spead up today’s analysis.
The county_fips.txt file in the data folder lists counties in the US, states they belong to, and their fips code. This was accessed from the USDA website.
For the analysis, we will need to:
Calculate the rate of encountering someone wearing a mask in each state
NEVER * 0 + RARELY * 0.25 + SOMETIMES * 0.5 + FREQUENTLY * 0.75 + ALWAYS * 1Wrangle data and combine the datasets
Create a violin + dot plot to compare masking rates in counties with colleges/universities to those without
Perform t-test
Create an interactive plot
We can use t.test() to compare the two groups.
t.test() to
understand the assumptions made in this statistical analysis.##
## Welch Two Sample t-test
##
## data: chances by college.county
## t = -15.959, df = 1639.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## -0.06998475 -0.05466467
## sample estimates:
## mean in group No mean in group Yes
## 0.7277745 0.7900992
We want to note here that Step 4a and 4b are not meant to be a scientific analysis of COVID-related data. We are using this to process model data wrangling and statistical analysis in R. Of note,
In this step, we will turn the violin/dot plot above into an
interactive plot. This will require the package plotly. For more about
plotly, specifically ggplotly() refer to this website.
We can display interactive plots as part of this R Markdown file, or export it as an independent html file. We will need the package htmlwidgets to export the interactive plot.
In this step, we will create an animated map showing 7-day rolling average of new cases in each state (normalized to state population). We will start by creating a static map.
The maps package is the great source of geospatial data in R. It’s great for making some basic maps. For more on creating maps using the maps package, see this webpage.
We will now build upon the static map and create an animated map showing 7-day rolling average of new cases in each state (normalized to state population).
The gganimate package builds upon the grammar of ggplot and provide functions for animation. For a cheatsheet of gganimate see (here)[https://ugoproto.github.io/ugo_r_doc/pdf/gganimate.pdf]
We’ll use the package gifski to save the animated map we generated as a gif
Congrats on reaching the last step! We hope this session has been interesting & informative.
Adding a date stamp and exporting the session info are great practices to ensure the reproducibilty of your code. The next code chunk helps you do both.
# Add date stamp
lubridate::stamp("Data updated December 31, 1979")(lubridate::now())
## Multiple formats matched: "Data updated %Om %d, %Y"(1), "Data updated %B %d, %Y"(1)
## Using: "Data updated %B %d, %Y"
## [1] "Data updated September 20, 2022"
# Add session info
# note, the [-8] omits packages installed, but not loaded for this analysis
utils:::print.sessionInfo(sessionInfo()[-8])
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gifski_1.6.6-1 gganimate_1.0.8 maps_3.4.0 htmlwidgets_1.5.4
## [5] plotly_4.10.0 geofacet_0.2.0 lubridate_1.8.0 zoo_1.8-11
## [9] here_1.0.1 forcats_0.5.2 stringr_1.4.1 dplyr_1.0.10
## [13] purrr_0.3.4 readr_2.1.2 tidyr_1.2.1 tibble_3.1.8
## [17] ggplot2_3.3.6 tidyverse_1.3.2
Reruning everything at the very end also helps ensure that your code is reproducible. To do so:
Clear objects from the workspace by going to Session > Clear Workspace… This will allow you to run code in this file without interference from prior runs, thus, testing the reproducibility of your code.
Open the “Run” dropdown manual, and click on “Run All” (upper right of Script Editor). This will run all the code in this document.
Once you hear Mario tunes, click the “Save” button (upper left of Script Editor). This should automatically generate/update an html file that you can share with a colleague.
Time to celebrate with some Mario tunes!
# B/c Mario is awesome
library(beepr)
beep(sound = 8)